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Abstract Artificial neural networks are simple and efficient 
machine learning tools. Defined originally in the traditional 
setting of simple vector data, neural network models have 
evolved to address more and more difficulties of complex 
real world problems, ranging from time evolving data to 
sophisticated data structures such as graphs and functions. 
This paper summarizes advances on those themes from the 
last decade, with a focus on results obtained by members of 
the S AMM team of Universite Paris 1 . 



Artificial neural networks provide some of the most effi- 
cient techniques for machine learning and data mining [43]. 
As other solutions, they were mainly developed to handle 
vector data and analyzed theoretically in the context of statis- 
tically independent observations. However, the last decade 
has seen numerous efforts to overcome those two limitations 
ET1 . We survey in this article some of the resulting solutions. 
We will focus our attention on the two major artificial neural 
network models: the Multi-Layer Perceptron (MLP) and the 
Self-Organizing Map (SOM). 



1 Introduction 

In many real world applications of machine learning and 
related techniques, the raw data are not anymore in a standard 
and simple tabular format in which each object is described 
by a common and fixed set of numerical attributes. This 
standard vector model, while useful and efficient, has some 
obvious limitations: it is limited to numerical attributes, it 
cannot handle objects with non uniform descriptions (e.g., 
situations in which some objects have a richer description 
than others), relations between objects (e.g., persons involved 
in a social network), etc. 

In addition, it is quite common for real world applica- 
tions to have some dynamic aspect in the sense that the data 
under study are the results of a temporal process. Then, the 
traditional hypothesis of statistical independence between 
observations does not hold anymore: new hypothesis and 
theoretical analysis are needed to justify the mathematical 
soundness of the machine learning methods in this context. 
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2 Multi-Layer Perceptrons 

The Multi-Layer Perceptron (MLP) is one the most well 
known artificial neural network model (see e.g., [5]). On a 
statistical point of view, MLP can be considered as a para- 
metric family of regression functions. Technically, if the data 
set consists in vector observations in R p , that is if each object 
is described by a vector x = (x\, ■ ■ ■ ,x p ) T , the output of a one 
hidden layer perceptron with k hidden neurons is given by 

k 

F e (x)=p + Y,aiY(wJx+bi), (1) 

i=l 

where the w ; - are vectors of R p , and the j3, a, and b, are 
real numbers (9 denote the vector of all parameters obtained 
by concatenating the w ; , a, and b,). In this equation, \j/ is a 
bounded transfer function which introduces some non linear- 
ity in Fq . Given a set of training examples, that is N pairs 
(X{,Yi), the learning process consists in minimizing over 9 a 
distance between Y, the target value and Fq (X{) the predicted 
value. Given an error criterion (such as the mean squared 
error), an optimal value for 9 is determined by any optimiza- 
tion algorithm (such as quasi Newton methods see e.g. Q), 
leveraging the well know backpropagation algorithm l62ll 
which enables a fast computation of the derivatives of F with 
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respect to 9. The use of a one hidden layer perceptron model 
is motivated by approximation results such as [26 1 and by 
learnability results such as 11631 (in statistical community, 
learning is called estimation and learnability consistency). 



2.1 Model selection issues for MLP 

It is well known since the seminal paper of ll3"Tl that MLP are 
an efficient solution for modeling time series whenever the 
linear model proves to be inadequate. The simplest approach 
consists in building a non linear auto-regressive model: given 
a real valued time series (T r ) ;GN , one builds training pairs 
Z, = (U t ,Y t ), where U t is a vector in W defined by U t = 
{Y t -\ , ■ ■ ■ , Y t - P )). Then a MLP is used to learn the mapping 
between the U t (the past of the time series in a time window 
of length p) and Y t (the current value of the time series), as 
in any regression problem. 

In order to avoid overlearning and/or large computation 
time, the question of selecting the correct number of neurons 
or, more generally, the question of model selection arises 
immediately. Standard methods used by the neural-networks 
community are based on pruning: one trains a possibly too 
large MLP and then removes useless neurons and/or con- 
nection weights. Heuristic solutions include Optimal Brain 
Damage 11321 and Optimal Brain Surgeon l24l . but a statisti- 
cally founded method, SSM (Statistical Stepwise Method), 
was introduced by ifTTl . The method relies on the minimiza- 
tion of the Bayesian Information Criterion (BIC). Shortly 
after, [64] and [55] proved the consistency (almost surely) 
of BIC in the case of MLPs with one hidden layer. These 
results, established for time series, allow to generalize the 
consistency results in [63 1 for the iid case. 

The convergence properties of BIC may be generalized 
even further. A first extension is given in [53]. The noise 
is supposed to be Gaussian and the transfer function y is 
supposed to be bounded and three times derivable. Then [53] 
shows that under some mild hypothesis, the maximum of the 
likelihood-ratio test statistic (LRTS) converges toward the 
maximum of the square of a Gaussian process indexed by a 
class of limit score functions. The theorem establishes the 
tightness of the likelihood-ratio test statistic and, in particular, 
the consistency of penalized likelihood criteria such as BIC. 
Some practical applications of such methods can be found in 
l38ll . The hypothesis on the noise was relaxed in [54|. The 
noise is no longer supposed to be Gaussian, but only to admit 
exponential moments. Under this more general assumption, 
BIC criterion is still consistent (in probability). 

On the basis of the theoretical results above, a practical 
procedure for MLP identification is proposed. For a one hid- 
den layer perceptron with k hidden units, we first introduce 

T n (k)=min(E n (e)+a n (k,8)), 



where E„(6) is the mean squared error of the MLP for pa- 
rameter 9 and a n (k, 9) is a penalty term. Then we proceed 
as follows: 

1 . Determination of the right number of hidden units. 

(a) begin with one hidden unit, compute T n {\), 

(b) add one hidden unit if T n (k+ 1) < T n (k), 

(c) if T„ (k + 1 ) > T„ (k) then stop and keep k hidden units 
for the model. 

2. Prune the weights of the MLP using classical techniques 
like SSM IfTTl . 

Note that the choice of the penalty term a„(k,9) is very 
important. On simulated data, good results have been re- 
ported for a n (k, 9) from a n (k, 9) = g " (8) n ' og(n) to a n (k, 9) = 

E J^hM ( S ee I55\, ED). 

Let us also mention that the tightness of the LRTS and, in 
particular, the consistency of the BIC criterion were recently 
established for more complex neural-networks models such 
as mixtures of MLPs PT| and mixtures of experts ll42l . 

2.2 Modeling and forecasting nonstationary time series 

As mentioned in the previous section, MLP are a useful 
tool for modeling time series. However, most of the results 
cited above are available for iid data or for stationary time 
series. In order to deal with highly nonlinear or nonstationary 
time series, a hybrid model involving hidden Markov models 
(HMM) and multilayer perceptrons (MLP hereafter) was 
proposed in ll50l . Let us consider (X t ) teflj a homogeneous 
Markov chain valued in a finite state-space E = {e\ , • • • , e^} 
and (T,) rGN the observed time series. The hybrid HMM/MLP 
model can be written as follows: 

Tr+i — Fx t+[ (Y t ,--- ,Y t - p +i) + Ox, +1 £ f +i, (2) 

where Fx t+1 € {F e} , F eN } is a regression function of order 
p. In this case, F e . is the z'-th MLP of the model, parameterized 
by the weight vector w,-. Ox t+l G , ■ ■ ■ , O eN } lS a strictly 
positive number and (e t ) te¥i is a iid sequence of standard 
Gaussian variables. 

The estimating procedure as well as the statistical proper- 
ties of the parameter estimates were established in [51 1. The 
proposed model was successfully applied in modeling diffi- 
cult data sets such as ozone peaks lfl4l or financial shocks 
E0- 

2.3 Functional data 

The original MLP model is limited to vector data for an 
obvious reason: each neuron computes its output as a non 
linear transformation \j/ applied to a (shifted) inner product 
w T x + b (see equation ([T|). However, as first pointed out in 
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(56), this general formula applies to any data space on which 
linear forms can be defined: give a data space 3£ and a set of 
linear functions °W from 3£ to R, one can define a general 
neuron with the help of w E W, as calculating y/(w(x) +b). 

This generalization is particularly suitable for functional 
data, that is for data in which each object is described by one 
or several functions |45 1. This type of data is quite common 
for instance in multiple time series setting (where each object 
under study evolves through time and is described by the 
temporal evolutions of its characteristics) or in spectrometry. 
A functional neuron [46 1 can then be defined as calculating 
\j/(b + f fwdfi), where / is the observed function and w is a 
parameter function. Results in [46 1 show that MLP based on 
this type of neurons share many of the interesting properties 
of classical MLP, from the universal approximation to statis- 
tical consistency (see also [47 1 for an alternative functional 
neuron with similar properties). In addition, the parameter 
functions w can be represented by standard numerical MLP, 
leading to a hierarchical solution in which a top level MLP 
for functional data is obtained by using a numerical MLP 
in each of its functional neurons. Experimental results in 
||46ll48ll show the practical relevance of this technique. 

3 Self-Organizing Maps 

As the MLP, Kohonen's Self-Organizing Map (SOM) is one 
of the most well known artificial neural network model [29|. 
The SOM is a clustering and visualization model in which a 
set of vector observations in M. p is mapped to set of M neu- 
rons organized in a low dimensional prior structure, mainly 
a two dimensional grid or a one dimensional string. Each 
neuron c is associated to a codebook vector p c in W (p c is 
also called a prototype). As in all prototype based clustering 
methods, each p c represents the data points that have been 
assigned to the corresponding neuron, in the sense that p c is 
close to those points (according to the Euclidean distance in 
W). The distinctive feature of the SOM is that each prototype 
p c is also somewhat representative of data points assigned to 
other neurons, based on the geometry of the prior structure: 
if neurons c and d are neighbours in the prior structure, then 
p c will be close to data points assigned to neuron d (and 
vice versa). On the contrary, if c and d are far away from 
each other in the prior structure, the data points assigned to 
one neuron will not influence the prototype of the other neu- 
ron. This has some very important consequences in terms of 
visualization capabilities, as illustrated in ll60l for instance. 

The original SOM algorithm has been designed for vector 
data, but numerous adaptations to more complex data have 
been proposed. We survey here three specific extensions, re- 
spectively to time series, functional data and categorical data. 
Another important extension not covered here is proposed in 
Ell which is built upon processing of multiple time series 
with recursive versions of the SOM. The authors show that 



trees and graphs can be clustered by those versions of the 
SOM, using a temporal coding of the structure. Recent ad- 
vances in this line of research include e.g. 1 19 1. Other specific 
adaptation include the symbol strings SOM described in (59]. 

3.1 Time series with metadata 

While the SOM is a clustering algorithm, it has been used 
frequently in supervised context as a component of a com- 
plex model. We described briefly here one such model as an 
example of complex time series processing with the SOM. 
Let us consider a time series with two time scales, i.e., that 
can be written down with two subscripts. The date is de- 
noted by (j,h) where j represents the slow time scale and 
corresponds for instance to the day (or month or year) while 
h = l,...,H corresponds to the observed values (e.g. the 
hours or half-hours of the day, the days of the month, the 
months of the year, etc.). Then the time series is denoted 
(cj)j>o — • • ■ ,Cj,h)) >0 . We assume in addition that 

the slow time scale is associated with metadata. For instance, 
if each j corresponds to a day in a year and one knows the 
day of the week, the month, etc. Metadata are supposed to be 
available prior a prediction. 

The original time series cj j, takes value in R, but the 
dual time scale leads naturally to a vector valued time series 
representation, that is to the c, g M. H . In this point of view, 
given the past of the vector valued time series, one has to 
predict a future vector value, that is a complete vector of H 
values. This could be seen as a long term forecasting problem 
for which a usual solution would be to iterate one-step ahead 
forecasts. However, this leads generally to unsatisfactory so- 
lutions either because of a squashing behaviour (convergence 
of the forecasting to the mean value of series) or to a chaotic 
behaviour (for nonlinear methods). 

An alternative solution is explored in lfl2l . It consists 
in forecasting separately, on the one hand, the mean and 
variance of the time series on next slow time scale step (that 
is, on the next j), and on the other hand, the profile of the fast 
time scale. The prediction of the mean and of the variance is 
done by any classical technique. For the profile, a SOM is 
used as follows. The vector values of the time series, i.e., the 
(ci) ;>o, are centred and normalized with respect to the fast 
time scale, that is are transformed into profiles defined by 

9j = ^~ ({cj,i - IJ-j, ■ ■ ■ ,Cj. H - IJ-j)) , (3) 

where \i, = jjY% =l c j:h and a) = ^lf =1 (c ; - A - are 
respectively the mean and the variance of cj. The profiles are 
clustered with a SOM leading to some prototype profiles p c . 
Each prototype is associated to the metadata of the profiles 
that has been assigned to the corresponding neuron. 

Then a vector value is predicted as follows: the mean fi 
and variance a are obtained by a standard forecasting model 
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for the slow time scale. Then the metadata of the vector to 
predict is matched against the metadata associated to neurons: 
assume for instance, that metadata are days of the week, and 
the we try to predict a Sunday. Then one collects all the 
neurons to which Sunday profiles have been assigned. Finally, 
a weighted average of the matching prototypes is computed 
and rescaled according to ji and a. As shown in ifPUl this 
technique enables both some stable and meaningful full day 
predictions, while integrating non numerical metadata. 



approach, as described in [ 1 1 and [13]. The same transfor- 
mation on BT or CDT is achieved and a SOM using the rows 
of the transformed tables can thus be trained. This training 
provides an organized clustering of all the possible values 
of the categorical variables on a prior structure such as a 
two dimensional grid. Moreover, if a simultaneous represen- 
tation of the individuals and of the values is needed, two 
coupled SOM can be trained and superimposed. The afore- 
mentioned articles present various real-world use cases from 
socio-economic field. 



3.2 Functional data 

The dual time scale approach described in the previous sec- 
tion has become a standard way of dealing with time series 
in a functional way, as shown in e.g. J4 ']. But as pointed out 



in Section 2.3 functional data arise naturally in other con- 
texts such as spectrometry. Then, the SOM has been naturally 
adapted to functional data in other contexts than time series. 
In those contexts, in addition to the normalization technique 
described above that produces profiles, one can use func- 
tional transformation such as derivative calculations in order 
to drive the clustering process by the shapes of the functions 
rather than mainly by their average values ||49l . 

Another adaptation consists in integrating the SOM with 
optimal segmentation techniques that represent functions or 
time series with simple models, such as piecewise constant 
functions for instance. The main idea it to a apply a SOM 
to functional data using any functional distance (from the 
L 2 norm to more advanced Sobolev norms jrjTl ) with an 
additional constraint that prototypes must be simple, e.g., 
piecewise constant. This leads to interesting visualization 
capabilities in which the complexity of the display is auto- 
matically globally adjusted [25]. 



3.3 Categorical data 

In surveys, it is quite standard that the collected answers are 
categorical variables with a finite number of possible values. 
In this case, a specific adaptation of the SOM algorithm can 
be defined, in the same way that Multiple Correspondence 
Analysis is related to Principal Component Analysis. More 
precisely, useful encoding methods for categorical data are 
the Burt Table (BT), which is the full contingency table be- 
tween all pairs of categories of the variables, or the Complete 
Disjunctive Table (CDT), that contains the answers of each 
individual coded as 0/1 against dummy variables that corre- 
spond to all the categories of all variables. Then, a Multiple 
Correspondence Analysis of the BT or of the CDT is nothing 
else than a Principal Component Analysis on BT or CDT, 
previously transformed to take into account a specific dis- 
tance between the rows and a weighting of the individuals 
|. The SOM can be adapted to categorical data using this 



4 Kernel and dissimilarity SOM 

The extensions of artificial neural networks model described 
in the previous sections are ad hoc in the sense that they are 
constructed using specific features of the data at hand. This 
is a strength but also a limitation as they are not universal: 
given a new data type, one has to design a new adaptation 
of the general technique. In the present section, we present 
more general versions of the SOM that are based on a dissim- 
ilarity or a kernel on the input data. Assuming the existence 
of such a measure is far weaker than assuming the data are 
in a vector format. For instance, it is simple to define a dis- 
similarity/similarity between the vertices of a graph, a data 
structure that is very frequent in real world problems l40l . 
while representing directly those vertices as vectors is gener- 
ally difficult. 



4.1 Dissimilarity SOM 

Let us assume that the data under study belong to a set 2£ 
on which a dissimilarity d is defined: d is a function from 

x 3£ to M + that maps a pair of objects x and y to a non 
negative real number which measures how different x and 
y are. Hypothesis on d are minimal: it has to by symmetric 
(d(x,y) = d(y,x)) and such that d(x,x) = 0. 

As pointed out above, dissimilarities are readily avail- 
able on sets of non vector data. A classical example is the 
string edit distance 1341 which defines a distanceQon symbol 
strings. More general edit distances can be defined, such as 
for instance the graph edit distance which measure distances 
between graphs [8 |. 

As the hypothesis on 3£ are minimal, one cannot as- 
sume anymore that vector calculation are possible in this set. 
Then, the learning rules of the SOM do not apply as they are 
based on linear combination of the prototypes with the data 
points. To circumvent this difficulty, l28ll suggest to chose 
the values of the prototypes p c in the set of observations (X,-),-. 
This leads to a batch version of the SOM which proceeds 

A distance is a dissimilarity that satisfies in addition the strong 
hypothesis of the triangle inequality: d(x,y) < d(x,z) +d(z,y). 
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as follows. After a random initialization of the prototypes, 
each observation is assigned to the neuron with the closest 
propotype (according to the dissimilarity measure) and the 
prototypes are then updated. For each neuron, the updated 
p c is chosen among the observations as the minimizer the 
following distortion 

£r(JV(X;),cy(X;,,p) (4) 

i 

where N(Xi) is X;'s neuron and F is a decreasing function 
of the distance between neurons in the prior structure. This 
modification of the SOM algorithm is known as the median 
SOM and is closely related to the earlier median version of 
the standard k-means algorithm [ 27 1 . 

In the case where (X,-),- is a small sample, the constraint 
to chose the prototypes in the data can be seen as too strong. 
Then, [ 15 1 suggests to associate several prototypes (a given 
number q) to each neuron. A neuron is represented by a 
subset of size q from (X,-) and the different steps of the SOM 
algorithm are modified accordingly. A fast implementation 
is described in 

A successful application of the dissimilarity SOM on real 
world data concerns school-to- work transitions. In 1391 . we 
were interested in identifying career-path typologies, which 
is a challenging topic for the economists working on the labor 
market. The data was issued from the "Generation' 98" survey 
by the CEREQ. The data sample contained information about 
16040 young people having graduated in 1998 and monitored 
during 94 months after having left school. The labor-market 
statuses had nine categories, from permanent contracts to un- 
employed and including military service, inactivity or higher 
education. 

The dissimilarity matrix was computed using opti- 
mal matching distances [1 1, which are currently the main 
stream in economy and sociology. The most striking oppo- 
sition appeared between the career-paths leading to stable- 
employment situations and the "chaotic" ones. The stable 
positions were mainly situated in the west region of the map. 
However, the north and south regions were quite different: in 
the north-west region, the access to a permanent contract (red) 
was achieved after a fixed-term contract (orange), while the 
south-west classes were only subject to transitions through 
military service (purple) or education (pink). The stability of 
the career paths was getting worse as we moved to the east 
of the map. In the north-east region, the initial fixed-term 
contract was getting longer until becoming precarious, while 
the south-east region was characterized by the excluding 
trajectories: unemployment (light blue) and inactivity (dark 
blue). 

Two other extensions of the SOM to dissimilarity data 
have been proposed; they both avoid the use of constrained 
prototypes. The oldest one is based on deterministic anneal- 
ing lfl8l while a more recent one uses the so-called relational 




Fig. 1 Career-path visualization with the dissimilarity SOM 1 39 1: colors 
correspond to the nine different categories 

approach that relies on pseudo-Euclidean spaces fl20l |23l . 
Both approaches lead to better results for datasets where the 
ratio between the number of observations and the number of 
neurons is small. 

4.2 Kernel SOM 

An alternative approach to dissimilarities is to rely on ker- 
nels. Kernels can be seen as a generalization of the notion of 
similarity. More precisely, a kernel on a set S£ is a symmet- 
ric function K from S£ x 3C to R that satisfies a positivity 
property: 

V7V e N* , V(jc ( -)i<kjv € & N , V(a«)i<Kiv e R N , 

N 

OCi(XjK(xi,Xj) > 0. 

ij=l 

For such a kernel, there is a Hilbert space (called the 
feature space of the kernel) and a mapping (j) from 3£ , such 
that the inner product in iff corresponds to the kernel via the 
mapping, that is : 

Wx)M*))*=kM- (5) 

Then K can be interpreted as a similarity on 3£ (values close 
to zero correspond to unrelated objects) and defines indirectly 
a distance between objects in as follows: 

d K {x,x) = \\<j)(x)-(j)(x')\\j^ 

= ^/K(x,x)+K{x',x')-2K(x 7 x'). (6) 

As shown in e.g. [57|, kernels are a very convenient way to 
extend standard machine learning methods to arbitrary spaces. 
Indeed, the feature space comes with the same elementary 
operations as M. p : linear combination, inner product, norm 
and distance. Then, one has just to work in the feature space 
as if it were the original data space. The only difficulty comes 
from the fact that (j) and are not explicit in general, mainly 
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because Jff is an infinite dimensional functional space. Then 
one has to rely on equation |5]) to implement a machine 
learning algorithm in completely indirectly using only K. 
This is the so called Kernel trick. 

In the case of the batch version of the SOM, this is quite 
simple [6]. Indeed, assignments of data points to neurons 
are based on the Euclidean distance in the classical numeri- 
cal case: this translates directly into the distance in the fea- 
ture space, which is calculated solely using the kernel (see 
equation |6|). Prototypes update is performed as weighted 
averages of all data points: weights are computed with the 
r function introduced in equation Q as a proxy for the 
prior structure. It can be shown that those weights, which 
are computed using the assignments only, are sufficient to 
define the prototypes and that they can be plugged into the 
distance calculation, without needing an explicit calculation 
of 0. Variants of this scheme, especially stochastic ones, have 
been studied in Il2l l36ll . It should also be noted that the rela- 
tional approach mentioned in the previous section l20l l23l 
can be seen a relaxed kernel SOM, that is an application of 
a similar algorithm in situations where the function K is not 
positive. 

While kernels are very convenient, the positivity con- 
ditions might seem very strong at first. It is indeed much 
stronger than the conditions imposed to a dissimilarity, for 
instance. Nevertheless, numerous kernels have been defined 
on complex data 1 16 1, ranging from kernels on strings based 
on substrings ||35l to kernel between the vertices of a graph 
such as the heat kernel 1 30 58) (see J6) for a SOM based 
application of this kernel to a medieval data set of notarial 
acts). Two graphs can also be compared via a kernel based 
on random walks ifTTl or on subtrees comparisons PHI . 



5 Conclusion 

Present days data are becoming more and more complex, 
according to several criteria: structure (from simple vector 
data to relational data mixing a network structure with cate- 
gorical and numerical descriptions), time evolution (from a 
fixed snapshot of the data to ever changing dynamical data) 
and volume (from small datasets with a handful of variables 
and one thousand of objects to terabytes and more datasets). 
Adapting artificial neural networks to those new data is a 
continuous challenge which can be solved only by mixing 
different strategies as outlined in this paper: adding complex- 
ity to the models enable to tackle non standard behavior (such 
as non-stationarity), theoretical guarantees limit the risk of 
overfitting, new models can be tailor made for some specific 
data structures such as graph or functions, while generic ker- 
nel/dissimilarity models can handle almost any type of data. 
The ability to combine all those strategies demonstrates once 
again the flexibility of the artificial neural network paradigm. 
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